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Abstract: Motivated by questions about dense (non-sparse) signals in high-dimensional 
data analysis, we study the unconditional out-of-sample prediction error (predictive risk) 
associated with three popular linear estimators for high-dimensional linear models: ridge 
regression estimators, scalar multiples of the ordinary least squares (OLS) estimator (re- 
ferred to as James-Stein shrinkage estimators), and marginal regression estimators. The 
results in this paper require no assumptions about sparsity and imply: (i) if prior infor- 
mation about the population predictor covariance is available, then the ridge estimator 
outperforms the OLS, James-Stein, and marginal estimators; (ii) if little is known about 
the population predictor covariance, then the James-Stein estimator may be an effective 
alternative to the ridge estimator; and (iii) the marginal estimator has serious deficien- 
cies for out-of-sample prediction. Both finite sample and asymptotic properties of the 
estimators are studied in this paper. Though various asymptotic regimes are consid- 
ered, we focus on the setting where the number of predictors is roughly proportional 
to the number of observations. Ultimately, the results presented here provide new and 
detailed practical guidance regarding several well-known non-sparse methods for high- 
dimensional linear models. 
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1. Introduction 

High- dimensional data analysis is one of the most active areas of current statistical research. 
Much of this research has been driven by technological advances across a variety of scien- 
tific disciplines, including molecular biology and genomics, that have enabled investigators 
to collect vast datasets with relative ease. The linear model has played a prominent role in 
recent literature on high-dimensional data analysis. In the linear model, observed outcomes 
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yi, ...,yn G M and corresponding dimensional predictors xi, ...,x„ e M.'^ are related via the 
equation 

yi = x[(3 + ei, l<i<n, (1) 

where f3 — Pd)'^ e W'' is an unknown parameter vector, and ei, e„ are unobserved iid 

error terms with mean and variance o"^ > 0. To simplify notation, let y = ^n)^ G M", 

X = (xi, x„)^, and e = (ei, e„)^. Then the observed data are (y,-'^) and (1) may be 
rewritten as y = Xf3 + e. 

In the out-of-sample prediction problem for the linear model, the goal is to find rules for 
predicting unobserved future outcomes, ynew — ^new^ + ^new, given the associated predictor 
vector ^new and the data (y,X). In the formulation considered here, a prediction rule is 
determined by an estimator /3 for f3 and the performance of the prediction rule is closely tied 
to properties of f3. The "usual" estimator for f3 is the ordinary least squares (OLS) estimator 
/3 = {X'^X)^^X'^y. However, the OLS estimator has drawbacks that are especially significant 
in high-dimensional data analysis, when the number of predictors d is large, e.g. instability. 
Furthermore, if d > n, then X^X is not invertible and the OLS estimator undefined (though 
this issue may be partially sidestepped by considering pseudoinverses, as is done below). Thus, 
alternatives to the OLS estimator are desirable. 

Much of the recent research on high- dimensional linear models and alternatives to the 
OLS estimator has focused on sparsity. In this research, sparsity plays at least two roles: (i) 
sparse estimators for (3 are often convenient, as they may aid interpretation(Fan and Li, 2006; 
Tibshirani, 1996) and (ii) if (3 is sparse in an appropriate sense, then this can often be leveraged 
to develop methods that perform very well, even with extremely high- dimensional datascts 
(Bickel et al., 2009; Bunea et al., 2007; Candes and Tao, 2007; Fan and Lv, 2011; Raskutti 
et al, 2011; RigoUet and Tsybakov, 2011; Ye and Zhang, 2010; Zhang, 2010). This provides a 
promising framework, which ideally yields interpretable estimators that perform well in high- 
dimensional data analysis. However, several recent papers in genomics and statistics have 
questioned the degree of sparsity in modern genomic datasets (see, for instance, (Hall et al., 
2009), and the references contained therein - including (Goldstein, 2009; Hirschhorn, 2009; 
Kraft and Hunter, 2009) - and, more recently, (Bansal et al., 2010; Manolio, 2010)). This 
suggests that a closer study of non-sparse (or "dense" ) methods for high-dimensional linear 
models may prove useful. 

This paper contains a careful analysis of three non-sparse linear estimators that are alterna- 
tive to the OLS estimator: ridge regression estimators, a class of James-Stein type estimators 
(scalar multiples of the OLS estimator), and marginal regression estimators. We study the un- 
conditional out-of-sample prediction error (predictive risk) associated with these estimators in 
a high- dimensional setting where the data are drawn from a multivariate normal distribution. 
Though all of these estimators have been studied extensively in the past, the results in this pa- 
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per offer unique and detailed insights into their comparative performance in high-dimensional 
data analysis, along with practical guidance for implementation and tuning parameter selec- 
tion. Symmetry properties of the estimators are easily leveraged in our formulation of the 
problem, which leads to many of the new insights delivered here. No sparsity assumptions 
are made throughout the paper. Though a direct comparative analysis of the estimators is 
emphasized, we also identify minimax ridge and James-Stein estimators (over the entire pa- 
rameter space). Broader optimality properties of non-sparse estimators are studied in (Dicker, 
2012). 

Ultimately, the results in this paper have significant practical implications for high-dimensional 
linear models when little is known about the sparsity of the underlying signal, which may be 
partially summarized as follows: (i) if Cov(xj) is known or if a norm-consistent estimator is 
available, then the ridge estimator outperforms the James-Stein, OLS, and marginal estima- 
tors (in fact, results in (Dicker, 2012) imply that the ridge estimator is nearly optimal for 
out-of-sample prediction in the described setting); (ii) if little is known about Cov(xj) or if 
d/n is small, then the James-Stein estimator may be an effective alternative to the ridge and 
OLS estimators; and (iii) the marginal estimator has serious deficiencies for out-of-sample 
prediction. 

2. Preliminaries: Definitions, notation, and an overview of results 
2.1. Out-of-sample prediction 

Each estimator, f3 = /9(y, X), of (3 determines a linear prediction rule, y(x) = x-^yQ. We define 
the unconditional out-of-sample prediction error (predictive risk) of f3 to be 

E{ynew - yi^new)y = E{ynew " ^lew^f ^ (2) 

where {ynew,^new) is independent of {y,X) and drawn from the same data-generating mech- 
anism as {yi,xj), and the expectation in (2) is taken over {ynew,^new) and (y, X). The goal 
of the unconditional out-of-sample prediction problem is to minimize (2) over estimators j3. 
In order to evaluate (2), the distribution of and Xj must be specified. We assume that 

Xi, x„ ~ N{0, E) and ei, e„ ~ N{0, cr^) are independent, (3) 

where S is a d x d positive definite matrix and > 0. These distributional assumptions are 
restrictive. However, other authors studying predictive risk have made similar assumptions 
(Baranchik, 1973; Breiman and Freedman, 1983; Brown, 1990; Leeb, 2009; Stein, 1960) and 
we believe that the insights imparted by the resulting simplifications are worthwhile. 
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The assumptions (3) imply that £'(xj) = and E{iii) = 0. In other words, the model consid- 
ered here does not have an intercept term. In many practical settings, it is more appropriate 
to allow E{yi), E(xi) ^ 0. In fact, all of the methods studied in this paper can accomodate 
data with an intercept term, provided one first follows the usual approach of centering the 
data, and then decorrelating the observations (i.e., adjusting for the degree of freedom lost 
upon centering the data). 

Let wf = (i/j, xf ) e and note that the assumption (3) is equivalent to assuming 

Wi, w„ ~ -/V(0, y), where 

and PDid^V) is the collection of all (rf+l) x (rf + l) positive definite matrices. The predictive 
risk (2) of an estimator /3 may be re-expressed as 

E{y^,^ - ^l^Jif = Ey {(^ - /3)^r(^ - /3)} + a\ 

where the subscript V in the expectation on the right-hand side above indicates that the 
expectation is taken over wi, w„ ~ -/V(0, V^. After standardizing by cr^, the predictive risk 
is equivalent to 

Ry{ii) = a-'Ev {0 - /3yS0 - /3)} . 

In fact, Ry{^) is the primary object of study in the sequel and we will typically refer to 
Rv{^) itself as the predictive risk (or out-of-sample prediction error) of y9. Note that the 
predictive risk Rv{P) is completely determined by the estimator y9 and the positive definite 
matrix V e PD{d+ 1). We will often write Es(-) in place of Ev{-) when the expectation 
only involves the random matrix X. Similarly, we write Pv{-) or Pr(-) when computing 
probabilities involving Wi, w„ or X, respectively. 

2.2. The estimators 

For a matrix A, let A~ denote its Moore-Penrose pseudo inverse. Below, we define the estima- 
tors studied in this paper. All of the estimators, which have the form f3 — Ay for some d x n 
matrix A. 

OLS estimator. = {X^X)'X^y. 

James-Stem estimator. ^j,{X) = (1 + \)-\X^X)~X^y, A > 0. 

Ridge regression estimator. $rW = {X'^X -\- nXIJ)~X'^y, A > 0. 
Marginal regression estimator. f3^ — n"^ E~ X^y. 
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The OLS estimator. This version of the OLS estimator is defined for all d, n because it utilizes 
the pseudoinverse {X'^X)~. 

The James-Stein estimator. A version of this estimator was proposed by Stein (1960). The 
parameter A > is a tuning (or shrinkage) parameter that must be specified by the user. 
Baranchik (1973) proved that for a certain data dependent Xbar, the estimator fB^j^r = (^jsi^bm) 
has smaller predictive risk than the OLS estimator (the estimator is discussed further in 
Section 8.2 below). We refer to 0js{X) as the James-Stein estimator because of its superficial 
resemblance to the James-Stein estimator for the normal means problem (James and Stein, 
1961). Notice that (3js{\) is a scalar multiple of the OLS estimator and /3js(0) = The 
James-Stein estimator is a shrinkage estimator and A determines the amount of shrinkage: 
\\f3x\ \ is decreasing in A, where || ■ || denotes the £^-norm. 

The ridge regression estimator. Many versions of the ridge estimator have been studied and 
have been shown to outperform the OLS estimator in a variety of settings (Casella, 1980; 
Golub et al., 1979; Hoerl and Kennard, 1970; Tikhonov, 1943). Perhaps the most common 
version of the ridge estimator has the form /3o(A) = {X^X -\-nXr)^^X^'y. The ridge estimator 
considered here /3,,,(A) has convenient symmetry properties and can be derived from a class 
of generalized ridge estimators proposed by Casella (1980) when considerations about out- 
of-sample prediction are taken into account. Notice that depends on the covariance 

matrix S — Cov(x,). In practice, if S is not known then it may be feasible to replace S 
with an estimate S to obtain a modified ridge estimator. The effect of replacing S with S on 
prediction error is discussed in Section 8.1. If prior information about S is available, then this 
can be incorporated into IJ, otherwise the sample covariance X" = n'^^X'^X may be used. If 
IJ — n~^X'^X is used in place of i7, then the modified ridge estimator reduces to the James- 
Stein estimator, ^^-^(A). This was observed previously by Oman (1984). Like the James-Stein 
estimator, the ridge estimator is a shrinkage estimator and A > is a shrinkage parameter 
that must be specified by the user. 

Marginal regression estimator. Variants of this estimator (that are often implemented with 
E — I) are known to have desirable screening and variable selection properties and have 
been used extensively for related applications (Fan and Lv, 2008). Like the ridge estimator, 
the marginal estimator depends on the covariance matrix Z". If Z" is not known, then it may 
be replaced with an estimate E . Taking E = n^^X^X gives the OLS estimator. It seems 
reasonable to also consider linear shrinkage estimators based on the marginal estimator, such 
as (1 -I- A)~^/3^, A > 0; these estimators are discussed further in Section 8.3. 
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3. Overview of results 

3.1. Symmetry properties 

In addition to being linear, the estimators defined in the previous section have notable sym- 
metry properties. In particular, they are linearly equivariant and scale equivariant. These 
properties help simplify predictive risk calculations and are discussed in Section 4. More fun- 
damentally, we argue in Section 4 that these are natural properties for non-sparse estimators. 

3.2. Finite sample results 

Both finite sample and asymptotic properties are studied in this paper. Finite sample results 
are the subject of Section 5. In finite samples, we identify oracle ridge and James-Stein 
estimators, f3^ = (^riK) ^js ~ ^jsi'^js)^ ^^^^ depend on the (typically unknown) signal- 
to-noise ratio 



(Propositions 4-5). These estimators have the smallest predictive risk among ridge and James- 
Stcin estimators with non-random shrinkage parameters A > 0. Simplified formulas for the 
estimators' predictive risk are also obtained. Similar results have been obtained in other 
settings, e.g. (Pinsker, 1980), (Goldenshluger and Tsybakov, 2003). The major novelty of our 
results is their simplicity and their applicability to out-of-sample prediction. These results 
provide the means for an initial comparative analysis of the estimators (Section 5.5). In 
particular, we show that 



To our knowledge, these are the first analytic results providing a direct comparison between 
the predictive risk of ridge regression and James-Stein estimators. In Section 5.5 we also argue 
that the marginal estimator has serious deficiencies for out-of-sample prediction. 

3.3. Asymptotic results 

In Sections 6-7, we study asymptotic properties of the estimators in high-dimensional settings. 
This helps provide a better understanding of the estimators' performance in high-dimensional 
settings. Asymptotic regimes where d/n — >■ 0, d/n — > p e (0, oo), and d/n — >■ oo are all 
considered. 




(6) 
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We find that in order to ensure consistency (i.e. asymptotically vanishing predictive risk) 
for any of the estimators considered here, one must have d/n — > 0. This is a common feature of 
non-sparse estimators that can be framed more generally in terms of minimax problems over 
highly symmetric parameter spaces (Donoho and Johnstone, 1994; Pinsker, 1980). Indeed, one 
way to formulate dense estimation and prediction problems is in terms of minimax problems 
over £^-balls. Dicker (2012) proved that the minimax rate for out-of-sample prediction over 
£^-balls is proportional to d/n. Thus, the estimators considered here achieve the minimax rate 
(this is not too noteworthy, as many estimators achieve the minimax rate for dense estimation 
and prediction: however. Dicker (2012) also proved the stronger result that the ridge estimator 
is asymptotically minimax over ^ -balls). 

Though the estimators studied here require d/n ^ for consistency, we devote much of 
our effort to studying asymptotic regimes where d/n — > p > 0. Our interest in these regimes is 
motivated by the emergence of important problems in high-dimensional data analysis where 
the role of sparsity is unclear (like those cited in Section 1 above), which highlight the im- 
portance of characterizing and thoroughly understanding dense problems and estimators in 
settings where d/n is significantly larger than 0. 

3.3.1. Related work: Sparse problems and ellipsoids 

In contrast with dense problems, in sparse problems it is known that consistent estimation and 
prediction may be possible even ii d/n ^ oo (Bickel et al., 2009; Bunea et al., 2007; Candes 
and Tao, 2007; Raskutti et al., 2011; Ye and Zhang, 2010). However, the required sparsity 
conditions (e.g. ^^-sparsity, < p < 2 (Abramovich et al., 2006)) may not hold in general and 
our motivating interest lies precisely in these situations. Other conditions on (3 may also allow 
for consistent estimation or prediction when d is much larger than n; for instance, if (3 belongs 
to an £^-ellipsoid with decaying axes, -B(c, a) = {/3 G M'^; ai(5'l + - ■ ■+a(i(i1 < c^}, where c > 0, 
a = (ai, ad)^ G W'-, and < ai < • • • < (Cavaher and Tsybakov, 2002; Goldenshluger 
and Tsybakov, 2001, 2003; Pinsker, 1980). In this direction, Goldenshluger and Tsybakov's 
(2001, 2003) work may be most relevant to ours. They study out-of-sample prediction with 
"blockwise" James-Stein estimators (which are, in a sense, a hybrid of the ridge and James- 
Stein estimators considered here) and obtain adaptive asymptotic minimax results over i"^- 
ellipsoids. Ultimately, these estimators leverage asymmetry in ellipsoidal parameter spaces 
(e.g. rapidly increasing ctj) to obtain faster rates of convergence. In the highly symmetric 
case that is most relevant to our results, where the parameter space is an £^-ball B(c) — 
B{c, {1, Goldenshluger and Tsybakov's results require rf/n — )■ to ensure consistency 

and do not apply if d/n ^ p > 0. In general, ellipsoid conditions are natural for many inverse 
problems in nonparametric function estimation, but they may be overly restrictive in other 
settings, such as the genomic applications discussed in Section 1. 
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3.3.2. Oracle estimators 

Section 6 of this paper contains a detailed asymptotic analysis of the predictive risk of /^o/g, 

^^^^^ 

/3„, (3jg, and (3^. We show that if d/n — > 0, then i?y(/3^) and Rv{l3j^ are asymptotically 
equivalent; whether or not these estimators are asymptotically equivalent to the OLS estimator 
depends on magnitude of the signal-to-noise ratio (Proposition 9). 

The regime where d/n ^ p & (0, oo) appears to be the natural setting for studying the 
estimators considered in this paper. Using results from random matrix theory, e.g. (Marccnko 
and Pastur, 1967), in Section 6 we obtain closed-form expressions for the asymptotic predictive 
risk of /3j., (5^^, (^ois^ (^m ^/''^ ~^ P ^ (0, oo). These formulas are new and imply that each of 
the estimators exhibits distinct behavior in this asymptotic regime. In particular, the benefits 
of the ridge estimator over the James-Stein, OLS, and marginal estimators observed in finite 
samples (6) persist when we pass to the limit. This contrasts with the case where d/n ^ 
and the ridge and James- Stein estimators are asymptotically equivalent. 

Finally, if d/n ^ oo, then (3 is non-estimable by any of the methods considered here, in the 
sense that their asymptotic performance is no better than that of /3„„/; = 0. In fact, results 
in (Dicker, 2012) imply that if d/n — > oo, then 0nuii ™ ^^^^ asymptotically minimax over 
£2-balls. 

3.3.3. Adaptive estimation 

Section 7 is concerned with adaptive estimation, when d/n ^ p & {0,1). The oracle estimators 
and depend on the signal-to-noise ratio r^^, which is typically unknown. We show that 
if d/n — )> p G (0,1), then 77^ may be replaced with an estimate 17^ and that the resulting 
estimators (which are called adaptive estimators because they "adapt" to the signal-to-noise 
ratio) are asymptotically equivalent to the oracle estimators. Note that this addresses the 
problem of tuning parameter selection for the ridge and James-Stein estimators. A corollary 
of the main result in Section 7 (Corollary 2, Proposition 10) imphes that the adaptive ridge 
and James-Stcin estimators arc minimax over V G PD{d+ 1), provided 0<^<d/n<©<l 
for some constants 6,Q & R and d, n arc sufficiently large. The requirement d < n for our 
results on adaptive estimation is related to the fact that if d > n, then y = XfS^ig and the 
usual estimator for cr^, a'^ — {n — d)~^\\y — X/3|p, is undefined. It may be possible to utifize 
other estimators for in settings where d> n, but this is not pursued in detail here. 

3.4' Miscellanea 

Some additional topics are discussed in Section 8. Recall that the ridge estimator depends 
on the predictor covariance matrix Cov(xj) — S. In Proposition 11, we show that if a norm- 
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consistent estimator of the predictor covariance S is available, then substituting S for E does 
not affect the asymptotic predictive risk of the ridge estimator when d^n. We also show that 
a previously proposed minimax James-Stein estimator (Baranchik's estimator $^,ar-i introduced 
in Section 2.2) is sub-optimal, in terms of asymptotic predictive risk when compared with the 
oracle and adaptive James-Stein estimators proposed in this paper (Proposition 12). Finally, 
we consider a shrinkage estimator based on the marginal regression estimator and show that 
the ridge estimator outperforms this estimator in terms of predictive risk. Section 9 contains 
a concluding discussion. Unless explicitly stated otherwise, all propositions are proved in 
Appendix B. 

4. Linear equi variance and scale invar iance 

In addition to being linear estimators, the estimators studied in this article share important 
symmetry properties. 

Definition 1. An estimator ^ — ^{y,X, S) is linearly equivariant if 



for all scalars teR\{0}. Ifan estimator is both linearly equivariant and scale invariant, we 



An estimator is linearly equivariant if it is compatible with linear transformations of the 
predictor basis; it is scale invariant if it is invariant under scaling of the data. Note that 

estimators in Definition 1 are allowed to depend on the population predictor covariance S 
and the compatibility criterion (7) implies that a linearly equivariant estimator's dependence 
on S must respect changes of basis. Speaking broadly, linearly equivariant estimators may be 
appropriate in situations where there is little prior knowledge about the information carried 
in the given predictor basis as it relates to the outcome. By contrast, sparsity assumptions 
convey exactly this type of information and linear equivariance is less appropriate for sparse 
signals. Notice that fS^i^ and f3„^, are LS; for any fixed A > 0, /3j.,(A) and l3rW ^-re also LS. 

The symmetry properties of the various classes of estimators described in Definition 1 
lead to some useful simplifications in their predictive risk. Recall the signal-to-noise ratio 



A-'^{y, X, S) = ^(y, XA, ASA'') 



(7) 



for all d X d invertible matrices A. It is scale invariant if 



^(y,X, S)^^^{ty,tX.t^S) 



say that it is LS. 



□ 
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(b)li^fi is LS, then RvC^) = RvM, where 

/ 1 + r]U^ \ 

77^ is the signal-to-noise ratio (5), and u e IR*^ is any fixed unit vector. In particular, Rv{0) 
depends on /3, S, and cr^ only through the signal-to- noise ratio (and d,n). □ 

Proposition 1 implies that when computing the predictive risk of LS estimators, we may 
assume without loss of generality that E — I , a"^ — 1, and /3 = r^u, for an arbitrary fixed 
unit vector u G W^. This is used repeatedly to prove the propositions below. On the other 
hand, we do not make the blanket assumption that i7 = J, cr^ = 1, and (3 = rfn^ in order to 
emphasize that some of the LS estimators considered here (namely, i9^(A) and depend 
on S, while others 0ois ^^"^ ^jsW) ^"^^ ~ ^^^^ distinction becomes less apparent if one 
assumes that E — I. 

More fundamentally. Proposition 1 implies that for LS estimators and out-of-sample pre- 
diction, sparsity is irrelevant. Indeed, the risk of an LS estimator is completely determined 
by the signal-to-noise ratio rf = 0^ EI3/a'^, which does not capture well-accepted notions of 
sparsity (e.g. £^-sparsity, < p < 2). Thus, LS estimators are robust to sparsity assumptions, 
which may be desirable in situations where little is known about sparsity. On the other hand, 
LS estimators are not able to take advantage of sparsity in situations where /3 is in fact sparse. 

5. Finite sample results 
5.1. The OLS estimator 

For d < n — 1, the predictive risk of the OLS estimator is well know. For d > n — 1, the analysis 
is straightforward, but less widely available in the literature (recall that the Moore-Penrose 
pseudoinverse is used to define 0^1^ when d > n). 

Proposition 2. 



1n-d-l u, ^ ^ 

_^ + ^2d^ ifrf>n + l 
00 if (i e {n — 1, n, n -h 1}. 

□ 

Notice that the predictive risk of the OLS estimator is finite whenever d ^ n — l,n,n -\- 1. 
In particular, it is finite when d> n-\-l] however, the OLS estimator is biased when d > n. 
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5.2. The marginal regression estimator 

To our knowledge, the predictive risk of the marginal regression estimator, 0^ has not been 
studied previously. 



Proposition 3. 



Rv{(3J = - + v 



n n 

□ 



5.3. The James- Stein estimator 

Proposition 4 yields the predictive risk of the optimal James-Stein tuning parameter A*^, 

and the oracle James-Stein estimator (3^^ = f3jg{X*g). The result follows from a bias- variance 
decomposition. 

Proposition 4- Let 

if d < n - 1 



d 

l-n-1) 

oo if d e {n — 1, n, n -|- 1} 



A*, = { n if > n + 1 



and let ^. = ^. (A*,) . Then 



(l^r^^ + dixr^^ ifd<n-l 

RviP^sW) ={ (iTa)' ^ + {^f + if > n + 1 

^ ^ if G in - + ^ 

oo 

and A < oo 



and 



ii d < n — 1 



,^Z^^y{Ki>^)^ = RviK) = \ ,.(,i^)^, + r;^^ ifrf>n + l 



rf if (i e {n — 1, n, n + 1}. 



□ 
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5.4- Ridge regression 

The predictive risk of optimal ridge parameter A*, and the oracle ridge estimator are 

identified in Proposition 5. We have been unable to find a closed form expression for the 
predictive risk of the oracle ridge estimator - such an expression may not exist. 

Proposition 5. Let A* = d/{nr) ) and let = /3^(A*). Then 



Rv0,{X)} = Ejtr I {X^X + nXI)-^ (^X^ X + I^A^/ ]\. (9) 



and 

Rv{K)= inf Rv{K{>^)}^EitY{X'^X + nX;i)-\ (10) 

Ae[o,cx3] 

□ 



5.5. A comparative analysis, part I: Oracle estimators and finite sample 
predictive risk 

Propositions 2, 4, and 5 immediately imply that Ryi^jg), Rv{^r) ^ ^vif^ois)- The predictive 

^ ^ ^ * ^ * 

risk of ^gig, /3jn, /3jg, and are all increasing in r] . Furthermore, if c? ^ {n — l,n,n + 1}, 
then 

^ = „„, jmBi = 1. (11) 

On the other hand, if r/^ = 0, then 

Rv{^*s) = RviK) = < Rv0ois)- 

These observations may be summarized as follows: (i) shrinkage estimators offer improvements 
over the OLS estimator in terms of out-of-sample prediction and (ii) these improvements are 
most substantial when the signal-to-noise ratio r]'^ is small and diminish as ry^ grows larger. 
Properties like these are common among shrinkage estimators in many contexts. 

At this stage, it appears to be difficult to make a detailed comparison between the predictive 
risk of the James-Stein estimator and that of the ridge estimator. However, we have the 
following result. 

Proposition 6. For any /3 G M*^, 



RviK) < Rv0*js) < RviKis)- 



L. Dicker/Linear Estimators and High-Dimensional Linear Models 



13 



The inequality on the left is strict unless rf = □ 
Proposition 6 is a consequence of Jensen's inequality and implies that the oracle ridge 
estimator has smaller predictive risk than the oracle James-Stein estimator yS^^. This is, 

perhaps, not surprising, given that the ridge estimator utilizes knowledge of the predictor 
covariance S while the James-Stein estimator (3^1^ does not. On the other hand, the marginal 
estimator /3„ also utilizes knowledge of the predictor covariance, but the next proposition 
suggests that it is less suitable for out-of-sample prediction. 

Proposition 7. Let ^nuii = 0- 
(a) Suppose that d <n — l. Then 



Rv{f3J<Rv{(3,i,) if and only if i^' < 



d 



n — d — 1 

if and only if Rv0nuii) < Rv0m)- 
(b) Suppose that d >n — 1. Then Rv{^nuii) ^ ^v{^m)- 



□ 



The proof of Proposition 7 is straightforward and is omitted. Proposition 7 implies that if 
the marginal estimator has smaller predictive risk than the OLS estimator, then the marginal 
estimator itself is outperformed by the trivial estimator 0nuii — 0- Additional drawbacks of 
the marginal estimator include that for any fixed d,n, limj^2 Rv{f3„j) = oo. On the other 

hand, if d < n — 1, \im.,i2^^ Rv0ois) = ^^^^n'^ -^oc Rv0js) = ^^^^n'^ ^oo Rvi^r) = d/{n — d—l) 
(if d > n, then the limiting risk of these estimators is infinite as well; this is related to 
non-estimability issues that are discussed further in Section 6.2.3). 

A direct corollary of Proposition 7 is that is dominated by the estimator 

if r]^ < — ^ or d > n - 1 



^'""^ ' I ^ois Otherwise" ^^^^ 

in the sense that Rv{^dom) ^ ^vi^m) '^it^ strict inequality whenever rf ^ d/ {n — d — 1). 
Since 

with strict inequality unless r] = 0, it follows that Rv{(3js) < R{l3d^ with equality if and 
only if rf' = 0. Thus, we recover (6): 

RviK) < Rv0l) < 



RviKis) 
Rv0J. 
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6. Asymptotic results for the oracle estimators 

The rest of this paper is primarily concerned with asymptotic results, which can be divided 
into two categories: results about the asymptotic predictive risk of oracle estimators and 
results about how well adaptive estimators approximate the oracle estimators. Asymptotic 
properties of oracle estimators are studied in this section. 



6.1. Ridge regression 



To obtain a formula for the asymptotic predictive risk of /3^, we rely on classical results from 
random matrix theory. For p e (0, oo) the Marcenko-Pastur density fp is defined by 



dFp 
dz 



z) = max {l - p S 0} 5q{z) H — v^4p - {z - p - 1)^, a< z <h, 



2-1: pz 



where a = (1 — ^/p)^, 6 = (1 + y/p)^, and ^o(.x') = 1 or according to whether x = or 
X ^ 0. The density fp determines the Marcenko-Pastur distribution, Fp, which is the limiting 
distribution of the eigenvalues of n^^X-'^X, if i7 = J, n — )■ oo, and d/n — > p G (0, cxd) 
(Marcenko and Pastur, 1967). The Stieltjes transform of the Marcenko-Pastur distribution, 



mp{s) 



z — s 



dFp{z) 



2ps 



— |s + p - 1 + V(s + P - 1)^ - 4ps| , s<0. 



(13) 



(Bai, 1993) has played a prominent role in the discovery and subsequent analysis of the 
Marcenko-Pastur distribution; see, for instance, (Bai, 1993), (Silverstein, 1995), and (El Karoui, 
2008b). The main result of this section implies that the risk of the oracle ridge estimator 
Rv{(3*) may be approximated by {d/ n)md/n{— K / ''^) ^ where A* is the optimal ridge parameter 
defined in Proposition 5. 

Proposition 8. Suppose that O<^^<(i/'ri<0<oo for some fixed constants 6', G M. 
(a) IfO<^<6<lorl<6'<6<oo and n - d > 5, then 



^ * d 

Rvi^r) md/n{-K) 



n 



O 



(b) If < ^ < 1< e < oo, then 



Rvi^r) rUd/ni-K) 

Th 
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□ 



There are two keys to the proof of Proposition 8. The first is the observation that 



Rv {/3r 



^EMn'^X^X + X;i)-' ^EiJ -1^ d¥n,a 



where F„ is the empirical cumulative distribution function of the eigenvalues of n~^X^X 
- in other words, the risk of the oracle ridge estimator is the expected value of the Stieltjes 
transform of F^^^. The second key is Theorem 3.1 of Bai (1993) which states that 



sup \E¥n,d{s) - Fd/n{s) 



0(n-i/4) if < ^ < e < 1 orl< ^ < e < oo, 

0(^-5/48) if0<e<l<e<OO. 



(14) 



The different rates in (14) for settings where l<^<©<ooorl<^<©<oo and 
0<^<l<©<oo helps to explain why these situations are considered separately in 
Proposition 8. 

Corollary 1. Define the asymptotic predictive risk of the oracle ridge estimator, 



Then 



lim sup 

d/n^p VePD{d+l) 



Rv{(3,) - Rr{d/n,r]^ 



0, 



provided p e (0, oo) \ {1}. 

Remark. The limit (15) indicates that n — >■ oo and d/n ^ p. 



(15) 
□ 



6.2. A comparative analysis, part II: Asymptotic predictive risk 
6.2.1. d/n^O 

Propositions 2-5 imply Rv0ois), Rv0*s), Rv0*s) = Oirfd/n) and Rv{K^) = 0{{l + 
rf)d/n}. It follows that if d/n 0, then the estimators are consistent. Additionally, we have 
the following result. 

Proposition 9. If d/n and d/inrf) ^ c e [0, oo], then 



Rv{(i 



1 , Rvilir) , 
and 1. 
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□ 

Thus, if d/n — )> and the signal-to-noise ratio is large (i.e. d/{n7f) — )■ 0), then the OLS, 
James-Stein, and ridge estimators are all asymptotically equivalent; ii d/n — )■ and the 
signal-to-noise ratio is small {d/{nrf') — >■ c > 0), then the James-Stein and ridge estimators 
are asymptotically equivalent and they both outperform the OLS estimator asymptotically. 



6.2.2. d/n^ p& (0,oo) 



li d/n ^ p G (0, oo), then Corollary 1 implies that the predictive risk of is non- vanishing. 
This is also true of the OLS, James-Stein, and marginal estimators. In fact, it is straightfor- 
ward to derive the asymptotic predictive risk of these estimators (all of the limits below are 
valid for fixed signal-to-noise ratios 1]"^; in fact, the convergence holds for varying degrees of 
uniformity in rj^ for the different estimators, however, these details are not critical for our 
analysis) . 



OLS: Rois{p.rf 



hm i?v(/3„,J 

d/n-^p 



T^p if p < 1 
^ + ^2^ ifp>l 



oo 



James-Stein: Rjs{p,rf') — lim 



Marginal : Rm{p, 



lim Rv{(3J 

d/n—i-p 



It is easy to check that 



Rrip,v'') < Rjsip,v'') < 



r^^ll — p| + p 
iv' + 1)P- 



Rois{p,v'^) 

Rm{P, V^), 



+ f) 



if p = 1, 

2(PV1)- 



(16) 



and that the inequality on the left is strict unless r^^ = 0. The inequalities (16) indicate 
that the advantages of the ridge estimator (over the Jamcs-Stcin, marginal, and OLS esti- 
mators) and the Jamcs-Stcin estimator (over the marginal and OLS estimators) persist in 
high-dimensional datasets, as n — )■ cxd and rf/n — > p G (0,oo). More fundamentally, they 
illustrate that different linear estimators may have significantly different out-of-sample pre- 
diction properties in high-dimensional data analysis - differences between the estimators do 
not "wash-out" in this asymptotic setting. In addition to providing this qualitative informa- 
tion, these asymptotic formulas provide an analytic tool for studying the various estimators' 
predictive risk. Figures 1-3 contain several plots of asymptotic predictive risk for the OLS, 
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Fig ^- (a) The asymptotic predictive risk of the OLS ((3^1^) and marginal (f3^) estimator versus p for multiple 
values of rf' . The asymptotic predictive risk is increasing with rj^ . (h) The asymptotic predictive risk of the 
marginal estimator and the dominating estimator (l3dom' defined in (12)) versus p, for rf' = 2.5. One easily 
checks that lim^j/n^p Rv{0^om) ^ P/(l ^ P) ^ according to whether p/ (1 — p) < rj^ or pj {I — p) > rj^. 



marginal, oracle James-Stein, and oracle ridge estimators. We point out that Figures 2-3 con- 
tain plots of the asymptotic predictive risk for the oracle marginal shrinkage estimator, which 
is introduced in Section 8.3 (Proposition 13). 

Figure 2 depicts the singularities in Rois{p, ^f) and Rjs{p, v"^) at p = 1. Notice that the ridge 
estimator's advantage over the other estimators appears to be most pronounced at p = 1. This 
is borne out by the fact that as r]"^ — > oo, Rr{l,rj'^) x r], but Rjs{l,rf) = rf. On the other 
hand, if p 7^ 1, then Rr[p,rf) / Rjs{p,rf) — 1 as 77^ — )■ 00. 

6.2.3. d/n 00 

Figure 2 also suggests that for fixed rf each of the depicted estimator's asymptotic predictive 
risk approaches the same finite limit, as p — )■ 00. One can easily check that this limit is rf. 
This is reflective of the estimators' behavior as d/n — )■ 00. Indeed, it follows directly from 
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(a) rf=\ 
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Fig 2. Asymptotic predictive risk versus p for the oracle ridge estimator ((3,.), the oracle James-Stein estimator 
(f3jg), the oracle marginal shrinkage estimator (13^! defined in Proposition 13), and the OLS (fiois) estimator 
for various values of rf : (a) if = 1, (b) ry^ = 5, and (c) if — 10. 



Propositions 2-5 that 

lim Rv{Ki,)=\im i?y(^;j= lim Ry{fy= \im Ry{(3*J = r,^ 

d/n-^oo a/n— >oo a/n— >oo a/n— >oo 

(/3^ is the marginal shrinkage estimator defined in Proposition 13; for the marginal estimator, 
Rv{(3^) — 7- oo as p — 7- oo). In fact, it can be shown that if d > n, then the predictive 
risk of any LS estimator must be at least //^(d — n)/d. Thus, if f3 is an LS estimator, then 
limmid/n^oo Rv{.(3) / Rvif^nuii) — 1- This is discussed further in (Dicker, 2012), where it is 
shown that if d/n — )■ oo, then /3„„;; is asymptotically minimax over ^^-balls. 

7. Adaptive estimators 

In the previous sections, we studied the predictive risk of f3^ and f3jg. This analysis provides 
substantial insight into the performance of ridge and James-Stein estimators. However, even 
assuming that S is known, the estimators and /3jg are usually not implementable, since 
they depend on the signal-to-noise ratio r^^ = f3'^I]f3/a'^, which is usually unknown. In this 
section, we show that if < inf d/n < sup d/n < 1, then rj'^ may be effectively estimated. More 
specifically, we show that the predictive risk of adaptive ridge and James-Stein estimators that 
utilize an estimate of t]"^ (and, subsequently, an estimate of the optimal shrinkage parameter) 
is very close to that of the corresponding oracle estimators. 
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(a) p = 0.25 (b) p = 0.75 (c) p = 2 




Fig 3. Asymptotic predictive risk versus the signal-to-noise ratio, rj , for the oracle ridge estimator (fi^.), the 
oracle James-Stein estimator ((3jg), the oracle marginal shrinkage estimator ((3„^, defined in Proposition 18), 
and the OLS (f^ois) estimator for various values of p: (a) p = 0.25, (b) p — 0.75, and (c) p — 2. 



7.1. Estimating the signal-to-noise ratio 

For d < n, define the estimator 



|2 




if = max \ ^ -1,0}= max -> "^^^.y" - -, )> , (17) 



where = {n — d) ^||y — X/j^^glp. To motivate this definition, notice that 

||y||2 ^ (3^S(3 + a' 



Results in Appendix A estabhsh convergence rates for Evlfj"^ —ri'^l'^ and other technical results 
that are important for proving Proposition 10 below. When d > n, a"^ is undefined and the 
estimator -rf breaks down. We conjecture that it is possible to derive effective estimators of 
rf when d > n, provided sup d/n < oo. This is discussed further in Section 9.1, however, it is 
not pursued at length in this paper. 



7.2. The adaptive ridge and James- Stein estimators 

Recall from Propositions 4 and 5 that the oracle ridge and James-Stein estimators are (3^. = 
^^(A*) and = ^j^{X*^), respectively, where A* = d/{n7f) and A*^ = d/{r]'^{n - d - 1)}. 
The next proposition is the main result in this section. 
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Proposition 10. Suppose that < 6' < d/n < B < 1 for some fixed constants 6', G G M and 
that n — d> 

(a) Define the adaptive ridge estimator = ^^(A*), where A* = d/{nff ). Then 

1 



^{rf + 1) J 



(b) Define the adaptive James-Stein estimator (5^^ = /3js(A*^), where A*^ = d/ {r)^(n— 1)}. 
Then 



V^(r?2 + 1) 



□ 



Remark 1. When fj — 0, we follow that convention that = (3^^ — 0. Notice that both 
and 1^^^ are LS. 

Remark 2. Proposition 10 implies that the predictive risk of the adaptive ridge and James- 
Stein estimators converge to the predictive of the corresponding oracle estimators uniformly 
for V e PD(d-\- 1). Moreover, if rj^ ^ n~^/^, then Proposition 10 implies that 



Rvm RvC^.s) 



1. 



On the other hand, if rj"^ = 0(n~^/^), then Proposition 10 is less useful for comparing the 
performance of and to the oracle estimators. Indeed, if 77^ = 0(n^^/^), then Rv{^js), 

Rv0r)^ Rv0*js)^ Rv0*r)^ Rv{Pnuii) = (recall that = 0) and the benefits of 

and f3jg over even ^^^n are unclear; however, in this setting, the estimators are consistent and 

Rv{(3nuii)/ R-vif^r), Rvif^nuii)/ Rvif^r) = 0(1), SO that eveu the oracle ridge and James-Stein 
estimators are not dramatic improvements on 

Remark 3. The condition < ^ < d/n in Proposition 10 can be removed. However, the 
corresponding error terms in part (a) and (b) are more complicated; a precise statement 
is omitted from this paper. Ultimately, however, when d/n ^ the message remains the 
same: if ri"^ is not too small, then the adaptive estimators perform nearly as well as the oracle 
estimators. 

Proposition 10 is proved in Appendix B. 

7.3. A comparative analysis, part III: Minimax estimators 



In addition to providing direct information about the performance the adaptive James-Stein 
and ridge estimators, vis-a-vis the oracle estimators. Proposition 10 also helps to shed light on 
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the performance of the adaptive estimators relative to the OLS estimator and to each other. 
Consider the following corollary to Proposition 10. 

Corollary 2. Suppose that Q < 6 < d/n < Q < 1 for some fixed constants 6', G G M and 
let and be the adaptive ridge and James-Stein estimators defined in the statement of 
Proposition 10. If n is sufficiently large, then 

Rv{K).Rv{^js) < Rv0ois) for all V E PD(d+l). (18) 

In particular, if n is sufficiently large, then the adaptive ridge and James-Stein estimators are 
minimax over the entire parameter space in the sense that 

sup Rv{$r) = sup Rvi^js) = i^^f sup Rv0), (19) 
vePD{d+i) vePD{d+i) vePD{d+i) 

where the infimum on the right-hand side of (19) is taken over all measurable estimators ^. 
□ 

The first part of Corollary 2 (regarding the inequalities (18)) follows directly from Propo- 
sition 10 and two observations: (i) if < ^ < d/n < © < 1 for constants ^, © e R, then there 
exists a constant c > such that 

^ * c 

Rvif^ols) ~ RviPjs) > 

whenever 77^ and n are sufficiently large, and (ii) Ry(^^) < Rv0jg). The minimaxity result 
(19) follows directly from the first part of the corollary and standard arguments from decision 
theory. 

8. Additional topics 

8.1. The ridge estimator: Estimating the population covariance 

If E is unknown, then even the adaptive ridge estimator can not be implemented. On the 
other hand, if IJ is an estimator of S, then it may be reasonable to use a ridge estimator of 
the form 

= ^,(A:, = {X^X + nX^Ijy'X^y, 
where A* is defined in Proposition 10. For a matrix A, let \\A\ \ denote it operator norm. 

Proposition 11. Suppose that < 6* < d/n < < 1 for some fixed constants 6*, e M and 
that n — d > 6. Then 

Ry{K) = RviK) + 0l^\\S-'\\ {Ey\\IJ - + O [^^^^^^] ■ 
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□ 

Proposition 11 implies that if the smallest eigenvalue of E is bounded below by some positive 
number and if S is operator norm-consistent in the sense that ii^y ||i^ — Z'lp — )■ 0, then the 
asymptotic predictive risk of = /3r(A*, i7) is the same as that of the optimal ridge estimator, 
/3j.. It is worth pointing out that, under the asymptotic setting described in Proposition 11, 
n~^X'^ X is not operator norm-consistent for E; indeed, /3^(A*, n~^X'^X) — is the adaptive 
James-Stein estimator and Proposition 10 implies that (3^^ is not asymptotically equivalent 
to the oracle ridge estimator. On the other hand, norm-consistent estimators for S may be 
available over wide classes of covariance matrices, subject to certain restrictions (Bickel and 
Levina, 2008; Cai et al, 2010; El Karoui, 2008a). 



8.2. The James- Stein estimator: Baranchik's estimator 

Baranchik (1973) studied the predictive risk of a James-Stein type estimator different from 
our adaptive James-Stein estimator, /S^^ = ^^(A*^). Baranchik proved that the estimator 

har - ^sCKar) = (1 + har)-'^ols = ( 1 " ^^T^f^ ) ^ols (20) 

has smaller predictive risk than the OLS estimator (thus, is minimax over V G PD{d + 1)) 
whenever d>3 and n — d>2, provided the constant c satisfies < c < 2{d — 2)/{n — d + 2). 

It is informative to consider the asymptotic predictive risk of Baranchik's estimator. First 
notice that (20) imphes 



A 



bar 



'ols\ 



lim/JP-c||y-X/3„,J|2 
The key observation is that if o?, n are large and d/nis close to p e (0, 1), then 

A ~ A A "^(^ ~ 

^bar ~ ^bar — o , /i \ ; 

r]'' + p - c{l - p) 

and, furthermore, X^ar is not in general equal to the limiting optimal shrinkage parameter, 

Xjs = hm A*3 = — -. 

This suggests that the asymptotic predictive risk of is suboptimal. Carrying this heuristic 
a step further, we take the limit as n — >■ oo and o?/n — >■ p e (0, 1), and utilize Proposition 4 to 
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obtain an expression for the asymptotic predictive risk of James-Stein type estimators with 
an arbitrary shrinkage parameter A > 0: 

Rjs[\p,V = lim Rv{(3j,{X)} = . 

This yields the approximate inequahty 

where (3jg is the adaptive James-Stein estimator defined in Proposition 10 and the approx- 
imation is vahd for large n and d/n close to p. It should be noted that the inequality 
Rjsi^bar, P,V^) ^ Rjs{Xjs, p,r]'^) is strict unless A^ar- = Ajs- Moreover, though the equality 
Xbar — Xjs may hold for some specific values of c, p, and rj'^, in order for it to hold in general, 
the constant c from Baranchik's estimator must vary with p and 1]"^. 

Some of the ideas from the previous discussion are made more rigorous in the next propo- 
sition, whose proof is omitted (part (a) is a straightforward calculation, the proof of (b) is 
similar to that of Proposition 10, and (c) follows directly from part (b) and Proposition 10). 

Proposition 12. 

(a) Suppose p e (0, 1). Then 

Rjs{Xjs:P:V^) < Rjs{Xbar:P:V^) 

with equality if and only if 

r/^p + p^ 



77^(1 -p)2 + p(l_p)' 



(b) Suppose that < ^ < rf/n < < 1 for some fixed constants ^, G K and that c is a 
positive constant satisfying < c < 2{d — 2)/{n — d + 2) for all n and d. Further suppose 
that n — d > 6 and let p — d/n. Then 

Rv{Kr)-Rv{Pis{\ar)} + 

(c) Under the assumptions of part (b) , 



(^2 + 1)^1/2 j • 



Rvi^bar) - RviPjs) = Rjs{Xbar,d/n,rf ) - Rjs{Xjs, d/ U, T]^) + O 



where is the adaptive James-Stein estimator from Proposition 10. □ 



L. Dicker/Linear Estimators and High-Dimensional Linear Models 



24 



Proposition 12 and the preceding discussion imply that /3^„^ is suboptimal in terms of pre- 
dictive risk, even among the class of James-Stein estimators, ^js{X)- This naturally leads to 
the question: are there other circumstances under which Baranchik's estimator is asymptot- 
ically optimal among James-Stein estimators? The answer is affirmative. A straightforward 
calculation (details omitted) implies that /^f,^^ is asymptotically optimal among James-Stcin 
estimator for in-sample prediction, where the relevant risk function evaluated at an estimator 
(3 is given by 

-^Ey\\x0-m'- 

8.3. The marginal estimator: Shrinkage 

In Section 5.5 and Proposition 7 we argued that the marginal estimator has significant 
drawbacks for out-of-sample prediction. One natural modification of the marginal estimator 
that could address some of these drawbacks is the marginal shrinkage estimator. 

Proposition 13 summarizes some properties of the marginal shrinkage estimator that are 
analogous to properties of the James-Stein and ridge estimators studied above. 

Proposition 13. 

(a) Oracle estimator. Let = d/{nrf') -|- (ci-|- l)/n and let — ^^{X*^). Then 

Ae[o,oo] r)'^[n-\- a-\-l) -\- a 

(b) Adaptive estimator. Suppose that < < d/n < < 1 for some fixed 6', G M and 
that n — d > 4. Let — dj {nrf) ■\- [d^ l)/n and define the adaptive shrinkage estimator 
^m = ^m(A;,).Then 

□. 

The proof of Proposition 13 is omitted. It may be proved using the same techniques used to 
prove the analogous results for the ridge and James-Stein estimators. Proposition 13 suggests 
that the marginal shrinkage estimator still performs relatively poorly when rf' is large. Indeed, 

hm > 1 (21) 
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In fact, the limit (21) is infinite if o? < n — 1. Contrast (21) with (11), which states that the 
corresponding hmit for the oracle ridge and James-Stein estimators equals 1. Additionally, 
note that the error term in part (c) of Proposition 13 is independent of 77, while the corre- 
sponding error terms in Proposition 10 for the adaptive ridge and James-Stein estimators are 
proportional to {1 ^- rf )^^ . 

Proposition 12 (a) implies that the asymptotic predictive risk of as d/n ^ p is 



for all p and rj'^, with equality if and only if rj^ = 0. In other words, the oracle ridge estimator 
asymptotically dominates the oracle marginal shrinkage estimator. On the other hand, neither 
the oracle marginal shrinkage estimator nor the oracle James-Stein estimator dominates the 
other. This is made clear in the plots of asymptotic predictive risk found in Figures 2-3. 

9. Discussion 

9.1. Adaptive estimators for d > n 

One limitation of the adaptive estimators considered in Sections 7 and 8 of this paper is the 
requirement d < n. This requirement is related to the fact that if d > n, then the usual 
estimator for a^, a"^ = {n — d)~^\\y — X/3„^g||^, is undefined. By replacing in (17) with 
an alternative estimator for that is defined for all d, n, one could obtain an estimator for 
that is defined for all d, n. This would immediately yield adaptive ridge and James-Stcin 
estimators for all d, n. However, estimating cr^ in settings where li > n is challenging. Recent 
work by Fan et al. (2012) and Sun and Zhang (2011) has considered estimating cr^ in settings 
where sup d/n — 00, provided /3 is sparse. We conjecture that even if (3 does not satisfy 
any sparsity conditions, it may be possible to effectively estimate o"^ when d > n, provided 
sup d/n < cxo, and that a adaptive ridge or Jamcs-Stcin estimator based on such an estimate 
of (T^ may have the same asymptotic predictive risk as the corresponding oracle estimator. 

9.2. Conclusions 

Motivated by questions about non-sparse signals in high- dimensional data analysis, we studied 
the predictive risk of the OLS, James-Stein, ridge, and marginal regression estimators in high- 
dimensional linear models. Our analysis provides new, practical insights into the performance 
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of these popular methods, while making no assumptions about sparsity. Both the ridge and 
James-Stein estimators may substantially outperform the OLS estimator in terms of predictive 
risk, especially if the signal-to-noise ratio is small. Additionally, the ridge estimator studied 
here leverages the population predictor covariance matrix S to obtain further improvements in 
out-of-sample prediction when compared to the James- Stein estimator. These improvements 
may be precisely quantified by our formulas for asymptotic predictive risk. On the other 
hand, we also showed that the marginal regression estimator has substantial drawbacks for 
out-of-sample prediction. 
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Appendices 

Appendix A: Required lemmas 

Lemma Al. Let > denote the smallest eigenvalue of rT^X'^X. Suppose that A; > is 
fixed and that n — d > 2k Then 



Proof. If = 1, then it is easy to check that the lemma is true. Thus, suppose that d > 2. 
Muirhead (1982) gives the joint density of the ordered eigenvalues, si > • • • > > 0, of 




(22) 



n-^X'^X: 




where 



(2/n)'^-/2r,(d/2)r,(n/2) 



and 



d 



r,{n/2) = n"^'-'^/' l[r{{n-j + l)/2} 
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is the multivariate gamma function. From this it follows that 

Eiis-^"") < Eiis^''; sa<c)+ c'^ < c^--<^-m-^ ^I^E {dei{n-^ Zf") + (23) 

Cn,(i— 1 

for any c > 0, where Zisannx((i — 1) dimensional matrix of iid standard normal random 
variables. It is easy to check that 



n/2 



c„,d_i r{(n-ci+l)/2}r(ci/2)" 
Additionally, it is well known (Exercise 3.11 in Muirhead (1982), for instance) that 

£;{det(n-Z^Z)Vn = (2/n)(-^-^)/^^'^-^^^^+,^f> = (2/n)(^-)/^^i^^^±i4rT 
^ ^ W V/; r,_i(n/2) r{(n-d + 2)/2} 

Thus, using Stirling's approximation, we obtain 

^"'^ -E {det(n-Z-Z) . V^(n/2)(— )/^r{(. + l)/2} 



cm-1 ' ' r{(n-ci + l)/2}r{(n-d + 2)/2}r(d/2) 

(2n)("-'^+^)/'r{(n + l)/2} 
2T{n-d+l)T{d/2) 



87r(n — d) \n — d J \dJ 



- ''87r(n-d) 



and, by (23), 



87r(n — (i) 

r , 1 -2/(n-d-l) 

Taking c = | ynd/{87r(n - d)}e"+2 } gives (22) . □ 



Lemma A2. Let Si > > denote the largest and smallest eigenvalues of n X X, 
respectively. Suppose that /c > is fixed and that < ci/n < 9 < 1 for some fixed constant 
e e R. 

(a) Ej{sf) = 0(1). 

(b) If n - d > 2A; + 1, then Eiis^'') = 0{1). 
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Proof. Part (a) is well known and may be easily derived from large deviations results for si 
(see, for example, Theorem 11.13 of (Davidson and Szarek, 2001)). Part (b) follows directly 
from Lemma A2. □ 

Lemma A3. Let A; be a fixed positive integer and suppose that 0<d/n<©<lfor some 
fixed constant © e R. Then 

Ev{ \, \ =0 



Tf-\-d/nj I \rf-\-d/n 



Proof. Notice that 



I \^ ^ \ 1 



Ev < E^ 



ff + 2d/nj - [||X^,;,||V(n(T2) + rf/n 
< Ev 



k 



\\X(3,,,\\^/{na^)-rd/n 

2 \ 



= Ev\'^ + l\ Ev 



J [||X/3„,J|V(n<72) + d/n 

and that Evia'^/a'^ + 1)*^ = 0(1). Since it is clear that Ev {ff + d/n)~'^ = 0{n/d), it suffices 
to show that ^ 

Ev\ . ^ 1 =0{rj-^^). 

Conditional on X, | p/a^ follows a noncentral distribution with d degrees of freedom 

and noncentrality parameter \\Xj3\\^/a'^. Thus \\Xl3oiJ\f' /a"^ has the same distribution as a 
central random variable with 2N + d degrees of freedom, where N\X ~ Poisson(C) and 
C = I |X/3| 12/(20-2). Now let W ~ be an independent random variable with m = 
{2k — d + 2) V 1 degrees of freedom. Then, since the A;-th inverse moment of a random 
variable with / degrees of freedom is 2~'^r(//2 — A;)/r(//2), provided / > 2k, it follows from 
Jensen's inequality that 

k f 

1 



Ev \ / ^ Ev 



X^„,,||V(n(j2) + d/n J [ ||X^„,J|V(n(j2) + dl{mn)W 



d J \\\X(3,i,\\ya^ + W^ 
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n{m + d) 



< 



< 



2d 




n{m + 


d) 


2d 




n{rn + 


d) 


2d 




n{m + 


d) 


2d 



E 



T{N +{d + m)/2 - k\ 
V{N + {d + m)/2} 



k oo 



fJ^N + {d + m)/2-i 



3=0 



EviC 



V 



-k\ 



iJ + k)l 



(24) 



where we have used the fact that C/^^ ~ Xn ^ distribution with n degrees of freedom. 
The lemma follows. □ 

Lemma A4. Suppose that < d/n < © < 1 for some fixed constant © e M and let A; > 
be fixed. If n > 2k, then 



Pvir = 0) - O 



_d^ 



Proof. Let U = \\X f3„i^\\y and let W = ||y-X/3„,J|V(j2 = (n-p)aya\ Then W ~ xl-d 
has a distribution with n — d degrees of freedom and, conditional on X, U ~ Xyx/sip/o-^ d 
has a noncentral x'^ distribution with noncentrality parameter ||X/3||^/a"^ and d degrees of 
freedom. Furthermore, U and W are independent and 



V 



d 



U/d 



n {W/{n-d) 



- 1 Wo. 



Thus, 



Pv{ri=0) = Pv 

< Ev exp 
1 



d n — d 



n — d 

{n-d)/2 



W El 



I — 2r 

n—d 



1 + f 



d/2 



n — d 



n — d — 2r 



{n-d)/2 



d 



d + 2r 



Ev exp 

d/2 



\\X(3\\ya' 



d + 2r 

d + 2r 
d + 2(r/2 + l)r 



n/2 
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< exp 



2r n 



d + 2r 



{n- d-2r){d + 2r) j {d + 2{r]^ + l)r 
provided r < {n — d)/2. Now, basic calculus implies that 



n/2 



^ + 2r Y'' , \ {d + 2r)k Y 
;.To'' ld + 2(r72 + l)r/ \r{n-2k)] ' 



sup T) 



2k 



The lemma follows by taking r = aVd for a > sufficiently small. 



□ 



Lemma A5. Suppose that < d/n < © < 1 for some fixed constant © e R and that /c > 
is fixed. If n — o? > 2k, then 



Ev\f-rf\^^0 



d/n + T) 
n 



+ ' 



n 



k/2 ( ■ 



Proof. Using Lemma A4, we have 



Ev\rf-rf\^ < 



\\Xkis\? 



= E^ 



V 



11^/3, 



ols 



- + r)^ 
n 



- + r}' 
n 



+ rj^'Pvir = 0) 



+ 



d^\ 
J 



(25) 



Since {n — d)a'^ = ||y — -'^ySoislP ll^/^o/sIP independent, 



E,, 



ols I 



n 



< 2^E^, 



W^f^olsW^ 12 



d\ a' 



+2^ T)^ + 



d 



n 



Ev 



a 



a" 



a 



2\ k 



TEv \-^\ E 



d 



\\X(3ois\? fo , d 

9 I ' 



+2"' ( + - ) 
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As in the proof of Lemma A3, let N ~ Poisson{| |X/3| |7(2cr2)}. Since \ \X^^^i^\\^ / a"^ ~ Xw+d^ 
we have 



V 



9 I ' 



< 2^Ev 



n 



V 



2N 



n 



n 



k/2' 



AdditionaUy, one can check that 



a 



- 1 



n 



-k/2 



It follows that 



E^ 



V 



- + ri' 
n 



O 



TV 



k/2 



The lemma follows by combining this with (25). 



□ 



Appendix B: Proofs of propositions contained in the main text 



Proof of Proposition 1. Suppose that /3 is linearly equivariant. To prove part (a), observe 
that 



a^Rvm = Ey[^{y,X,E)-f3] S [p{y,X, S) - (i] 
= Ey\\^{y,XS-"\l)-S'/^f3\^ 
= Ev,\\(3{y,XJ)-f3{V^)\^ 

where /3(Vo) = ^^^^/3, and wc have used linear equivariance of y3, along with the fact that if 
Xi ~ A^(0, E), then S'^l'^yii ~ N{Q, I). 
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Now suppose that ^6 is LS, let u e M*^ be a unit vector, and \ei U he & d x d orthogonal 
matrix such that US^/'^f3/a = ryu = (3{V^). Then 



Rvm 



Vb 



;3 - 17^2/3 



a-'Uf3{y,XJ)-'nv, 



^{XU^r]\i + e/cr, XU^, I) - r]U 



^iy,X,I)-f3iV^) 



□ 



= RvM- 

Proof of Proposition 2. Proposition 1 and a simple bias- variance decomposition lead to 

Rv{^ois) - RvX^ois) - r^^Ej I - /} u| f + EiiY{{X^X)-] 

If d < n, then the first term on the right-hand side above is equal to 0; if d > n, then, by 
symmetry, it is equal to r]'^{d — n)/d. Using properties of the inverse Wishart distribution (see 
Chapter 3 of (Muirhead, 1982), for instance) the second term on the right-hand side is equal 
to — d — 1), if (i < n — 1; it is equal to n/{d — n — 1) if o? > n + 1; and it is infinite 
otherwise. The proposition follows. □ 



Proof of Proposition 3. Fix a unit vector u = Ud)^ e M*^. Then 



Rv{(3^) = Ev^\n-'X'y-r,u 



d 



n 



Xjy - r]Uj 



(26) 



where Xj is the j-th column of X. Considering each term in the summation above separately, 
we have 



E^ 



1 nr 

-X . y - r]Uj 
n ■' 



y = ^v„|(^XjX,-l)ry«, + igxJX,,7x.,+ ixj6| 



rjW.Ei 



n 



XJX, 
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n 



n ^ " n 



n 



The proposition follows by summing over j above and using the identity (26). 
Proof of Proposition 4- It is easy to check that 



□ 



where 



Ji = Eitr{{X^X)-} 



1 



1 + A 



Ji 



A 



1 + A 



J2 + J3- 



— ^ a d < n- 1 

n—d—l 

if > n + 1 

d—n—l 



r]'Ei\\{X' XyX' Xu 



00 if p G {n — 1, n, n + 1} 

rf ii d <n 
rf^ \id> n 



,^E,\\{i-ix-x)-x-x]^\\: = { I'^ii 



□ 



Hence, (8). The rest of the proposition follows by basic calculus. 

Proof of Proposition 5. For j = 1, let e denote the j-th standard basis vector. 
Fix A e [0,00]. Then 

RviKm = £;y.J|^,(A)-/3||' 

= Ev,^ I \'nn\{X^X + nXiy^ej | f + Ey,^ \ \ {X^X + nXI)''^ X^ e\\^ 

= Ei I \r]n\{X'^X + n\iy^ej\\^ + E/tr {(X^X + nXiy'^X'^X] . 
Summing over j = 1, d above gives 



1 1 

7] n 2 



d 



X^EitT{{X^X + nXI)-^} + Eiti {{X^X + nXiy^X^X} 



and (9) follows. 

To prove (10), let Si > • • • > denote the eigenvalues of n'^X'^X. Then 
Rv0rW} = El i^Y.{ns, + nX)-' (^ns, + ^a) | . 



L. Dicker/Linear Estimators and High-Dimensional Linear Models 34 

It is easy to check that each of the d summands on the right-hand side above is minimized 
by taking A = and that Rv{K) = Eiti{X^X + nA;/)-^ □ 

Proof of Proposition 6. The inequahty Rv{/3js) < Rv{(^ois) discussed above and follows 
from Propositions 2 and 4. The other inequahty follows from Jensen's inequality. As in the 
proof of Proposition 5, let Si > • • • > > denote the eigenvalues of rT^X'^X. First suppose 
that d<n-l. Then 

Rv{K) = Eitr{X^X + nX;i)-^ 




< o EMX^X)-^ 
- ' rj^ + Eitr{X'^X)-^ 
rj^d 



rf{n — d — 1) -\- d 



where the inequality is strict unless 77^ = 0. Ifd>n-|-1, then = s„+2 — ■ ■ • — sa — ^ and 
a similar calculation implies that Rv{l3r) < ^vil^js)- Finally, if d e {n — l,n,n-\- 1}, then it 
is clear that Rv{K) <'rf ^ Rv{^]s)- ^ 

Proof of Proposition 8. Let '¥n,d be the empirical cumulative distribution function of the 
eigenvalues of n'^X'^X. Using integration by parts, for c > 0, 

^tr(X^X + nA;/)-i = / -^d¥^,,{s) 

" Jo S + 



Similarly, 
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[ 



-{l-Fa/n{s)} ds. 



35 
(28) 



Now let A = \Rv{(ir) — {d/n)md/n{—^*r)\- Then Theorem 3.1 of Bai (1993) (see equation 
(14) in Section 6.1 above) and the inequahties 



A < - 

n 



1 







(s + A*)2 

which follow from (27) and (28) , imply 



EI¥r^,d{s) - Fa/n{s)\ ds < r^^sup \Ei¥n,d{s) - Fa/n{s)\ , 



s>Q 



A- 



0(^2^-1/4) ifo<^<e<iori<^<e<oo, 

(5(^2^-5/48^ if0<^<l<e<OO. 



Part (b) of the proposition follows immediately. 

To prove part (a) of the proposition, we show that, in fact, A = 0{n~^^^) ifO<^^<G<l 
or 1 < 6^ < 6 < oo. First suppose that < 6^ < 6 < 1. Then, for c < (1 - ^Jdjnf, 



md/n{-K) 



c + Xt 



(s + a;)2 



{1 - Fd/n{s)} ds 



and 



A < Ei 



s + Xl 



c + A; 



{s + a;)2 



{Ei¥n,d(s) - Fd/n(s)} ds 



< / ^-^M +^^.f„,(c) + ^ 



sup \Ei¥n^d{s) - Fd/n{s)\ 



< Ei{s/; Sd<c) + 



< {E,{sf)y^' P^isd < c)i/2 + c-^Pi{sd <c) + sup \Ej¥^,d{s) - F,/„(s)| , 



-^-^Pl{Sd < C) + ^SUp \Ei¥n,d{s) - Fd/n{s)\ 

C + A„ C + A„ s>c 



s>c 



where > is the smallest eigenvalue of n~^X'^X. We bound the first two terms and the 
last term on right-hand side above separately. Bounding the first two terms relies on a result 
of Davidson and Szarck (2001). Their Theorem 11.13, which is a consequence of concentration 
of measure, implies that 
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provided c < 1 — d/n. Additionally, Lemma A2 in Appendix A implies that Ei[s^ = 0(1) 
if n - d > 5. Taking c = (1 - it follows that 

{Ej{sf)f' P{s, < cy/' + c-'P{s, < c) = 0{n-'/') 

(in fact, we can conclude that the quantities on the left above decay exponentially, but this 
is not required for the current proposition). It now follows from Theorem 3.1 of Bai (1993) 
that A = 0(n-i/^). For the case where 1 < ^ < G, we note that the same argument as above 
applies to XX'^, which has the same nonzero eigenvalues as X^X. Part (a) of the proposition 
follows. □ 

Proof of Proposition 9. The first statement is easily verified. To prove the second statement. 
Proposition 6 implies that it suffices to show 

hminf > 1, 

for c e [0, oo]. But this follows from Jensen's inequality, which implies 

d/n 



1 + d/{nr]'^) 



< Etr{X^X + {d/rf)I}-^ = Rv{(3r)- 



□ 



Proof of Proposition 10. Assume that < 6 < d/n < Q < 1 for some fixed constants 
6', G M and that n — o? > 9. We prove (a). The proof of (b) is entirely similar. Since and 
(3^ are LS, 

RviK) = RvAK) 

= Ey^\\{X^X + nXlI)-'X^y-r^u\\' 

' 'I d 



n 



u 



d 



n^ 



n^ 



2rj'^Ev^e' X I rj^-X^X + -I 



d 



n 



n 



u 



r^^r]^-(X^X + -l) X^e 
n n 
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and 



-I 2^V^ 



-1 



rj^}LX^X + -I\ u 
n n 



if ( i^^-X'^X + -/ ) X' e 
n n 
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n n 



u 



n 



H — ^-^Vu 



n 
1 



u 



n n 



where u e M'^ is a fixed unit vector. Thus, 



\Rv{K) - Rv{K)\ < \EvM + \EvM + 2 \EvM , 



(30) 



where 
Hi 



9 d 

V - 
n 



Ho ^ ^ 



n n I 



rf[rf-X'^X + -I] X'e 
n n 



rflx''X + -l\ \ 
n n I 



rf[rf-X^X + -I] X'e 
n n 



d 



r]fj'—e'Xlfj'-X'X + -n u. 



n 



n 



We consider the terms \Ev^Hi\, \Ev^H2\, and lEv^H^] separately. 

Let Si > • • • > > denote the ordered eigenvalues of n~^X'^X and let U he a (ixo? orthog- 
onal matrix such that S = n~^U'^X^XU is diagonal. Additionally, let u = {ui, Ud)'^ — U^u 
and let 6 = {5i,...Jdf = U^{X^ X)-^'^X'^e. Then 
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n 



u]sj(rf - if ) 



Y- 

{fj'^Sj + d/n){r]'^Sj + d/n) Xfj^Sj + d/n rj'^Sj + d/ 



< 



< 



n ~^{ff + d/n){rf -\- d/n) \ff-\-d/n ' rf-\-d/ 



1 1 
+ 



In 



-.2 "^i 



8(d/n)|772-,)2i 



1 1 

+ 



and 



if -\- d/n \f]'^ + d/n rf-\-d/nJ 
^ \ {fj% + d/nf {rj\ + d/nf 



1 

— + si 



< 



< - 

n 



1 
n 

1 

n 

4 
n 

4,,~„ 



— ^ ('V^'Sj + d/n){rfsj + rf/n) \57^Sj + d/n rfsj + li/ 



+ 



In 



5'j\'if - rf 



Y- 

^ {ff + d/n){rf + (i/n) 



+ S. 



- + Si . 



{ff + d/n){rf + d/n) \Sd 
Thus, by Lemmas A2, A3, A5, and the Cauchy-Schwarz inequaUty, 

1 

y/n{l + 77^) 

To bound \Ev^H^\, we use integration by parts (Stein's lemma): 



\EvuHi\ + \Ev^H2\ — O 



d 



V- 



n 



3/2 



-2 l/2r ~ 
57 S - OjM 



i "J "-J 



(7725^. + d/ny^ 



d 



d ^2 1/2 c ~ 



^ (r^^Sj + d/n)'^ 



(r^^Sj - d/n){^/n1]SJ^'^Uj + 6j)sj'^6jU 



.1/23 



Thus, 



a'^{ffsj + (i/n)^ 

d / ^ 1/2 ~ , ? ^ 1/2? ~ 



a'^{ffsj -\- d/ny 



L. Dicker/Linear Estimators and High-Dimensional Linear Models 



39 



3/2 ^ 



+877 



1 



-E- 



^5/2-^" ^ ^2(^2 + ^/^)2 1^ ^3/2 + 



1/2 



1 



^/n{l + 7]^) J 

Combining this with (30) and (31) completes the proof of the proposition. □ 

Proof of Proposition 11. Suppose that 0<^<d/n<©<l for some fixed constants 
^, O G M. Let u G M'^ be a unit vector and suppose that a^^U S^^"^ (3 = r]U, where U is 
adxd orthogonal matrix. Define S = IJ{y,X) = U S'^/^ IJ{ay, XU S^/^)IJ-^/^U^ and let 
/3(A*, E) = = {X'^ X +nX*S)~^ X'^y be the adaptive ridge estimator defined in Proposition 
10. Then 

Rv {3(A;,l;)} = i?y„{3(A:,r)} 

and Proposition 10 implies that it suffices to show 
Now notice that 

Rv^[~^{x;,s)}-Rv^[~^{x;,i) 



3(A;,r)-3(A;,/) 



Considering integrands from the terms on the right-hand side above separately, we have 

f,^d\T- 



mK,E)\\'-mK.i) 



n 



n 



^-X^X + ^E 



n 



n 



rr 



tx^X + 
n n 
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-X'^Xil - r) + (/ - S)-X^X 
n n 



^X^X + -E 



n n 

2 



< 4 + - 

n 



d\ |^AJI>| 



< 



n — d 



Sl + 



d 



nJ St: 



|y - 



\I -s\ 



where si > • • • > are the eigenvalues of n ^X'^X. A similar calculation yields 



||;3(A:,r)-^(A;,7)||< 



(32) 



(33) 



The proposition follows by taking expectations in (32)- (33) and applying Lemmas A2 and 
A3, along with the Cauchy-Schwarz inequality. □ 
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